| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | names | 0 | 1.0000000 | 1 | 98 | 0 | 9660 | 0 | NA | NA | NA | NA | NA | NA | NA |
| character | date_x | 0 | 1.0000000 | 10 | 10 | 0 | 5688 | 0 | NA | NA | NA | NA | NA | NA | NA |
| character | genre | 85 | 0.9916487 | 3 | 86 | 0 | 2303 | 0 | NA | NA | NA | NA | NA | NA | NA |
| character | overview | 0 | 1.0000000 | 12 | 998 | 0 | 9905 | 0 | NA | NA | NA | NA | NA | NA | NA |
| character | crew | 56 | 0.9944979 | 8 | 1357 | 0 | 9927 | 0 | NA | NA | NA | NA | NA | NA | NA |
| character | orig_title | 0 | 1.0000000 | 1 | 86 | 0 | 9730 | 0 | NA | NA | NA | NA | NA | NA | NA |
| character | status | 0 | 1.0000000 | 8 | 15 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA |
| character | orig_lang | 0 | 1.0000000 | 4 | 35 | 0 | 54 | 0 | NA | NA | NA | NA | NA | NA | NA |
| character | country | 0 | 1.0000000 | 2 | 2 | 0 | 60 | 0 | NA | NA | NA | NA | NA | NA | NA |
| numeric | score | 0 | 1.0000000 | NA | NA | NA | NA | NA | 6.349705e+01 | 1.353701e+01 | 0 | 59 | 65 | 71 | 100 |
| numeric | budget_x | 0 | 1.0000000 | NA | NA | NA | NA | NA | 6.488238e+07 | 5.707565e+07 | 1 | 15000000 | 50000000 | 105000000 | 460000000 |
| numeric | revenue | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2.531401e+08 | 2.777880e+08 | 0 | 28588985 | 152934876 | 417802077 | 2923706026 |
Predicting Box Office Success: A Machine Learning Approach
Data Science 2 with R (STAT 301-2)
Introduction
The objective of this report is to present findings of predictive modeling done on a movies dataset. I would like to see how well a movie’s revenue can be predicted using other features of the movie/data.
These predictions are useful because having an accurate prediction model for movie revenue could be highly valuable for investors and filmmakers because it would help estimate a film’s financial success before release. That would allow for better budget allocation, and even optimize marketing strategies.
Data Overview
Below is a table showing the attributes of our movies dataset, like the column type and number of missing variables. With this table, we see that there are very few missing values, and those that are missing are present in the genre and crew columns. The main purpose of this check is to see any missingness in the target variable, which is not present in this data.
The table below provides a more concise skimming of the data, telling us that we have just under 10,200 observations, 12 columns, 126 rows with missing values.
| Metric | Value |
|---|---|
| Rows with Missing Values | 126 |
| Number of Observations | 10178 |
| Number of Variables | 12 |
Number of Missing Target Variable Observations (revenue) |
0 |
On the left, we have the original distribution of the target variable, revenue, visualized with a density plot and box plot. The original values were heavily skewed right, so it was apparent a transformation of revenue was needed. After experimenting with log, Yeo-Johnson, square root, and Box-Cox transformations, I settled on a Yeo-Johnson transformation to make the values more evenly distributed.
The Yeo-Johnson transformation reshapes data to make it more normal by adjusting values differently based on whether they are positive or negative. For positive numbers, it applies a power transformation or a log function, while for negative numbers, it flips them, transforms them, and flips them back. The density and box plot on the right shows those values post-transformation.1
1 A lambda value of 0.25 was used.
Data Wrangling
The movies dataset initially had 12 columns and I was only able to extract 3 predictors for my models, but I have been able to conduct more manipulation of the data to allow for more usable predictors in my models. Below is a glimpse of the manipulated dataset.2
2 For the sake of understanding, only the first 2 columns are really relevant; refer to movies_clean_codebook.csv for clearer information on what each variable is.
| skim_type | skim_variable | n_missing | complete_rate | Date.min | Date.max | Date.median | Date.n_unique | character.min | character.max | character.empty | character.n_unique | character.whitespace | factor.ordered | factor.n_unique | factor.top_counts | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | num_genres | 10178 | 0.0000000 | Inf | -Inf | NA | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | names | 0 | 1.0000000 | NA | NA | NA | NA | 1 | 98 | 0 | 9660 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | score | 0 | 1.0000000 | NA | NA | NA | NA | 1 | 3 | 0 | 79 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | overview | 0 | 1.0000000 | NA | NA | NA | NA | 12 | 998 | 0 | 9905 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | crew | 56 | 0.9944979 | NA | NA | NA | NA | 8 | 1357 | 0 | 9927 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | orig_title | 0 | 1.0000000 | NA | NA | NA | NA | 1 | 86 | 0 | 9730 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | status | 0 | 1.0000000 | NA | NA | NA | NA | 8 | 15 | 0 | 3 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | budget_x | 0 | 1.0000000 | NA | NA | NA | NA | 1 | 11 | 0 | 2316 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | negative | 660 | 0.9351543 | NA | NA | NA | NA | 1 | 2 | 0 | 25 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| factor | orig_lang | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | FALSE | 54 | Eng: 7417, Jap: 714, Spa: 397, Kor: 388 | NA | NA | NA | NA | NA | NA | NA |
| factor | num_crew | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | FALSE | 12 | 9: 8729, 7: 211, 8: 207, 6: 202 | NA | NA | NA | NA | NA | NA | NA |
| numeric | genre | 10178 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | NA | NA | NA | NA | NA | NA |
| numeric | revenue | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.531401e+08 | 2.777880e+08 | 0 | 2.858898e+07 | 1.529349e+08 | 4.178021e+08 | 2.923706e+09 |
| numeric | country | 10178 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | NA | NA | NA | NA | NA | NA |
| numeric | positive | 660 | 0.9351543 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.955138e+00 | 2.728433e+00 | 0 | 0.000000e+00 | 1.000000e+00 | 3.000000e+00 | 4.000000e+01 |
| numeric | overall_sentiment | 10178 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | NA | NA | NA | NA | NA | NA |
| numeric | date | 10178 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | NA | NA | NA | NA | NA | NA |
| numeric | yeo_revenue | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.205222e+02 | 1.742970e+02 | 0 | 2.884891e+02 | 4.408224e+02 | 5.678770e+02 | 9.261295e+02 |
Methods
Data Splitting
I decided on a split proportion of 80/20 because it gives an optimal amount of data in the testing and training sets. I also used the default number of strata and stratified by our target variable, yeo_revenue (the Yeo-Johnson transformed revenue).
Resampling
I will be using V-fold cross-validation for resampling, with 5 folds (V) and 3 repeats. This method is useful because it ensures that each data point is used for both training and testing, providing a more reliable estimate of model performance.
Metrics
The primary metric I used for comparing and selecting a final model is Mean Absolute Error (MAE). MAE measures the average magnitude of the errors between the predicted and actual values, without considering their direction. I picked MAE as my metric because it is commonly used in regression modeling and is a relatively simple metric to interpret and explain.
Model Descriptions
This is a regression problem because the target variable, revenue, is continuous rather than categorical. Since it is a regression problem, I knew I would like to fit/tune a Null/Baseline model, an OLS model, EN models, a Random Forest Model, a Boosted Trees model, and a K-Nearest Neighbors model.
Null/Baseline
A null model is a simple model that based solely on the mean (for regression). It serves as a basic reference to compare the performance of more complex models.
A baseline model is a simple predictive model used as a benchmark. Its use is to set a reasonable lower bound for predictive performance. Comparing against a baseline helps measure whether advanced modeling techniques provide meaningful improvement.
OLS
An Ordinary Least Squared (OLS) model is used for when relationships are linear, features are independent, and interpretability is needed. An OLS model captures linear relationships, effect sizes, and predictor significance well, and is a almost always one of the first models to be used in a regression problem. It provides another fairly simple benchmark for our model.
Elastic Net
Elastic Net (EN) is a regularized regression method that combines Ridge and Lasoo penalties. It stabilizes coefficient estimates like Ridge while also eliminating irrelevant variables like Lasso. EN is particularly useful when predictors are highly correlated, when there are more features than observations, or when handling noisy or sparse data.
In EN, the penalty controls how much the model shrinks the coefficients, with higher values forcing them closer to zero. The mixture decides the balance between Lasso, which can set some coefficients to zero, and Ridge, which shrinks them without making any exactly zero. A mixture of 1 is pure Lasso, 0 is pure Ridge, and values in between blend both.
The hyperparameters I will be tuning are mixture and penalty.
Random Forest
Random Forest builds multiple decision trees and averages their predictions to improve accuracy and reduce overfitting. It is ideal for capturing nonlinear relationships and handling high-dimensional data.
In Random Forest, mtry is the number of features randomly chosen at each split, affecting how diverse the trees are. min_n is the minimum number of data points needed in a node before it can split, controlling how deep the trees grow.
The hyperparameters I will be tuning are min_n with a range of 2-40, and mtry with a range of 1-10.
Boosted Trees
Boosted trees is a method that builds decision trees sequentially, where each tree corrects the errors of the previous one, leading to high predictive accuracy and strong performance on complex datasets.
In Boosted Trees, we have mtry and min_n as hyperparamaters as well. learn_rate controls how much each tree contributes to the final prediction, with lower values leading to slower but more stable learning. tree_depth sets how deep each tree can grow, balancing complexity and overfitting.
The hyperparameters I will be tuning are min_n with a range of 2-40, mtry with a range of 1-10, learn_rate with a range of -5 to -0.2, and a tree_depth range of 6 to 10.
K-Nearest Neighbors
K-Nearest Neighbors (KNN) is a non-parametric model that classifies or predicts values based on the majority vote of nearby data points or their average. It works well for nonlinear relationships.
The hyperparameter I will be tuning is neighbors.
In K-Nearest Neighbors (KNN), the neighbors parameter sets how many nearby points the model looks at when making a prediction. A small values makes the model sensitive to noise, while a large value smooths predictions but may miss local patterns. Choosing the right value balances accuracy and generalization.
There are no hyperparamaters to tune for the Null/Baseline or OLS models, as the model is fit to the resamples as normal.
Recipes
There are 2 recipes defined for each model (or set of models in the case of the EN models), excluding the Null/Baseline models that use only one recipe. There is each a “basic” recipe and a “complex” recipe used for the models.
Null/Baseline Model Recipe
The first recipe was for the null and baseline models. My first step was to impute the mean of all missing numerical values. I also imputed the mode of the overall_sentiment column because of NAs. Next, I extracted the year and month out of the date column to use as predictors.
Linear Model Recipes
For my complex Linear model recipe, most of the steps are the same as the null/baseline recipe, except I created interactions with 2 sets of predictors I feel would work together in this predictive process, and scaled and centered the numerical predictors to prevent multicollinearity. I also added a Yeo-Johnson transformation of the budget_x column. For my basic Linear model recipe has everything previously stated, except the interactions and the budget_x transformation.
Tree Based Model Recipes
For my basic tree model recipe, I kept the same steps as my basic Linear model recipe. For my complex recipe, I created a new column called season that splits up the months into the 4 seasons and used that as a predictor as well.
Model Building and Selection Results
Basic Workflows
| mtry | min_n | Model_Type |
|---|---|---|
| 10 | 11 | Random Forest |
The Random Forest Workflow’s best hyperparameters are an mtry of 10 and min_n of 11. The MAE had a steep decline in the beginning and began to level out at an mtry of 5.
| neighbors | Model_Type |
|---|---|
| 10 | K-Nearest Neighbor |
Figure 3 and Table 5 show that the KNN best hyperparamters are a neighbors value of 10.
| mtry | min_n | tree_depth | learn_rate | Model_Type |
|---|---|---|---|---|
| 10 | 2 | 9 | 0.0398107 | Boosted Trees |
Figure 9 (a) has 5 rows of plots with the first row having a tree depth of 6, the next having a tree depth of 7 and so on. In each adjacent plot in the row, the learn rate and number of randomly seleected predictors (mtry) increase. The different min_n values are hard to discern in the plots because they mostly overlap. As learn_rate and mtry increase, the MAE graudally decreases, settling upon an mtry of 10, a min_n of 2, a tree_depth of 9, and a learn_rate of ~0.0398 as the best hyperparameters.
| penalty | mixture | Model_Type |
|---|---|---|
| 0.5 | 1 | Elastic Net |
EN’s best hyperparameters are a mixture of 1 and a penalty of 0.5, which is a Lasso model.
Figure 6 shows the distribution of the RMSE’s of the “basic” model workflows, along with the null and baseline models. The sharp decline from the null model to the rest of the models is evidence that more complex models and recipes are needed to produce better predictions. Additionally, among the complex models, the RMSE is still decreasing, though the change is not much between a few models. Since the changes between the Baseline, KNN, OLS, and EN model are not very significant, we learn that we should try to be pickier about the models we want to tune/fit, since the computational toll is not worth it for those models. On the other hand, Random Forest and Boosted Trees have a significant decrease from the 4 aforementioned workflows, and that proves that tuning those models were helpful and worth the longer times it took to run them.
| Model Type | MAE | Std Error | n |
|---|---|---|---|
| bt | 73.1161 | 0.3557 | 15 |
| rf | 73.9316 | 0.2997 | 15 |
| knn | 93.5128 | 0.2134 | 15 |
| en | 93.7335 | 0.2472 | 15 |
| lm | 93.7535 | 0.2453 | 15 |
| baseline | 93.7535 | 0.2453 | 15 |
| null | 149.3789 | 0.2950 | 15 |
Table 8 shows the more precise MAE’s for the “basic” workflows. The MAE of null model is significantly higher than the rest, at about 149. KNN has an MAE of about 93.5. The baseline, EN, and Linear model are all around the same MAE values of about 93.7. Then, Random Forest and Boosted Trees have a significant drop to around 73.9 and 73.1, respectively.
Complex Workflows
| mtry | min_n | Model_Type |
|---|---|---|
| 10 | 11 | Random Forest |
The complex Random Forest workflow’s best hyperparamters are an mtry of 10 and an min_n of 11.
| neighbors | Model_Type |
|---|---|
| 10 | K-Nearest Neighbor |
The complex KNN workflow’s best hyperparameters is a neighbors value of 10.
| mtry | min_n | tree_depth | learn_rate | Model_Type |
|---|---|---|---|---|
| 10 | 2 | 8 | 0.0398107 | Boosted Trees |
The complex Boosted Trees workflow produced very similar results to those shown in Figure 9 (a), though best tree_depth hyperparameter is 8 instead of 9.
| penalty | mixture | Model_Type |
|---|---|---|
| 0.75 | 1 | Elastic Net |
The complex EN workflow’s best hyperparameters are a penalty of 0.75 and an mixture of 1, which is a Lasso model again.
Figure 11 shows the distribution of the MAE’s of the “complex” model workflows, along with the null and baseline models. The plot paints a very similar picture compared to the “basic” workflows, but the most notable difference is that the EN and OLS models perform significantly worse than their basic workflows. That could mean the feature engineering I conducted in the complex linear recipes were not a very good fit for those models.
| Model Type | MAE | Std Error | n |
|---|---|---|---|
| bt | 73.6254 | 0.2962 | 15 |
| rf | 74.6195 | 0.3022 | 15 |
| en | 85.7130 | 0.2683 | 15 |
| lm | 85.7521 | 0.2622 | 15 |
| knn | 93.5641 | 0.2176 | 15 |
| baseline | 93.7535 | 0.2453 | 15 |
| null | 149.3789 | 0.2950 | 15 |
Boosted Trees and Random Forest have the lowest MAE’s and they are within one standard deviation of each other, so we can pick either one of those models for the final analysis. Since the standard errors for the Boosted Trees model are lower, I will be picking Boosted Trees as the best model. Additionally, since the basic Boosted Tree model workflow has a lower standard error, I will specifically be using that workflow for the final model analysis.
Differences in Performance Between Model Types and Recipes
The EN and OLS models were fit with the same set of recipes, a more “basic” one and a more “complex” one. Their performance is around the same, which should make sense since they are both linear regression and using sets of recipes. What interests me is that the Baseline model performs very similarly to them in the basic workflow, even though it was fit with a different recipe. Granted, for the basic EN and OLS workflows, the recipes used were not actually different from the null/baseline recipe used to fit the baseline model. But for the complex EN and OLS recipes, I added interactions and a transformation of the budget column, which seemingly made them perform worse in general, and worse than the baseline model.
Another interesting point is the KNN model. Though I used the same set of the recipes to fit the KNN, Boosted Tree and Random Forest Workflows. KNN performed significantly worse than the other 2. That could mean that I did not perform adequate tuning, the recipe needs changes, or the predictors and/or the data as a whole does not work well with the model. In the future, it could be meaningful to change the way I tuned the KNN model or possibly find different ways to change the recipe so that the model could perform better, though those changes are not a guarantee the KNN model would perform better.
In the end, I am not surprised Boosted Trees and Random Forest did the best since their capabilities in predictive modeling are more robust and overall capture many different aspects of the data when producing predictions.
Final Model Analysis
After training the final model (Boosted Trees) and predicting on the test set, I calculated MAE, as well as R-squared and RMSE to provide more context on the model’s predictions. Additionally, since we used yeo_revenue as the target variable and not the original revenue, I will be transforming the values back to the original scale so that meaningful deductions can be made with the MAE.
RMSE (Root Mean Squared Error) measures how far a model’s predictions are from actual values, with lower values meaning better accuracy. R² (R-squared) shows how well the model explains variation in the data, ranging from 0 to 1, where higher values mean better fit. RMSE focuses on error size, while R² measures how much of the data’s pattern the model captures.
| Metric | Value |
|---|---|
| RMSE | 98.8236043 |
| MAE | 73.1188859 |
| R² | 0.6735839 |
Table 14 says that the RMSE is at a 98.8, while the MAE is at 73.4. When calculating RMSE and MAE, we are looking for lower values that show our predicted values are not too far off from the actual values. The model produces a better MAE value than RMSE which could be because of the RMSE squaring already large values, making them much larger and distorting the scale.
| Metric | Value |
|---|---|
| RMSE | 170,343,822.1188754 |
| MAE | 100,211,340.5876970 |
| R² | 0.6672412 |
On the original scale in Table 15, the numbers are very high (though that can be expected since movie revenue can go up to the hundreds of millions). The original scale allows us to better interpret the model predictions, showing that the RMSE of 99.2 is equivalent to almost $170 million and the MAE of 73.1 is equivalent to $100.2 million. The Rsquared also reduces slightly because of the scaling changes, which is not a very good sign as it was already at 66.7%. The 66.7% means that the model only explains 66.7% of the variability in the results, which is not a very good number. We would prefer the number to be higher, as that means the model is capturing the variance well and adjusting accordingly.
Figure 12 shows that the predicted revenue with the actual revenue on the Yeo-Johnson scale. The Boosted Trees model was able to make some correct or almost correct predictions, but there are still many predictions that are far off the actual values.
Figure 13 has the predicted revenue with the actual revenue on the original scale.
To better see the outcomes of the model, Figure 14 is the distribution of the previous Figure 13 but restricted to between 0 and 500,000,000. This plot shows that there is a bit more overfitting then underfitting present (at least for this region) which can make sense since it is a Boosted Tree model and they can be prone to overfitting.
Figure 15 is the distribution of Figure 12 but between the range of 250 and 500, one of the clustered regions on the full dimension plot. The less-than-ideal performance of this model was foreshadowed by the R-squared value, which could mean that this prediction problem is difficult to capture in modeling. There could have been inadequate feature engineering as well, which would need to be another area of focus if exploring the reasons for the poor predictions.
Conclusion
In this modeling process, I feel my data wrangling, feature engineering, and tuning, do not adequately capture my prediction problem.3 I find that surprising, because in theory I thought it would make sense that other aspects of a movie can give future insight into how the movie would perform upon release, and I figure that it does, but my method is not really showing that. The results of these models are not objectively bad, as they could have performed much worse, but I think there is definitely room to explore this type of problem with different datasets in the hopes of reaching better results. I believe I can say that the features of a movie can and does predict the movie’s revenue, but maybe not as strongly as I had thought in the beginning. Though not as strong, it is still a great place to start and in the future, I could use more detailed aspects of a movie and its background. Questions like: “Who produced the movie?”, “Is it a part of a franchise?”, or “Is it a shorter film or long film” are good things to ask. I would also like to look at more specific datasets that possibly have a restricted range on revenue so that it is easier to capture and interpret.
3 How well a movie’s revenue can be predicted using other features of the movie/data.
Comment on Generative AI Use
I utilized AI (specifically ChatGPT) to help myself better understand the different models I used, as well as to gain a deeper understanding of the Yeo-Johnson transformation done on the target variable. I prompted it to describe each of them in simple terms to allow me to write the sections that explained what each model is/does and what specific types of manipulation are done to certain values in a Yeo-Johnson transformation.